Parallel Implementation of Multiple Linear Regression Algorithm Based on MapReduce

نویسندگان

  • Moufida Rehab Adjout
  • Faouzi Boufares
چکیده

The amount of data generated by traditional business activities, creating data repositories ranging from terabytes to petabytes in size. However, this information cannot be practically analyzed on a single commodity computer because the data is too large to fit in memory. For this purpose, the large size of data to be processed requires the use of high-performance analytical systems running on distributed environments. Because the data is so big it affects the types of algorithms we are willing to consider. Then standard analytics algorithms need to be adapted to take advantage of cloud computing models which provide scalability and flexibility.This paper introduces a new distributed training method, which combines the widely used framework for bigdata, MapReduce, with the traditional structure of multiple linear regression.Parallel processing of multiple linear regression will be based on the QR decomposition and the ordinary least squares method adapted to MapReduce. Our platform is deployed on Cloud Amazon EMR service.Experimental results demonstrate that the our parallel version of the multiple linear regression can efficiently handle very large datasets on commodity hardware with a good performance on different evaluation criterions, including number, size and structure of machines in the cluster.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

QSRR Study of Organic Dyes by Multiple Linear Regression Method Based on Genetic Algorithm (GA–MLR

Quantitative structure-retention relationships (QSRRs) are used to correlate paper chromatographic retention factors of disperse dyes with theoretical molecular descriptors. A data set of 23 compounds with known RF values was used. The genetic algorithm-multiple linear regression analysis (GA-MLR) with three selected theoretical descriptors was obtained. The stability and predictability of the ...

متن کامل

Implementation of the direction of arrival estimation algorithms by means of GPU-parallel processing in the Kuda environment (Research Article)

Direction-of-arrival (DOA) estimation of audio signals is critical in different areas, including electronic war, sonar, etc. The beamforming methods like Minimum Variance Distortionless Response (MVDR), Delay-and-Sum (DAS), and subspace-based Multiple Signal Classification (MUSIC) are the most known DOA estimation techniques. The mentioned methods have high computational complexity. Hence using...

متن کامل

An evaluation of the performance of parallel database operators using Phoenix MapReduce

The database join operator is the most expensive operator of the relational algebra operators. Many highly efficient sequential and parallel operators exist, based on several core techniques: sort-merge, hash and nested-loops. We present the design and implementation of two parallel operators: an equi-join and a grouping aggregation. They utilise the emerging MapReduce paradigm, specifically a ...

متن کامل

Simple one-pass algorithm for penalized linear regression with cross-validation on MapReduce

In this paper, we propose a one-pass algorithm on MapReduce for penalized linear regression fλ(α, β) = ‖Y − α1−Xβ‖ 2 2 + pλ(β) where α is the intercept which can be omitted depending on application; β is the coefficients and pλ is the penalized function with penalizing parameter λ. fλ(α, β) includes interesting classes such as Lasso, Ridge regression and Elastic-net. Compared to latest iterativ...

متن کامل

Large - Scale Non - Linear Regression within the Mapreduce Framework

Large-scale Non-linear Regression within the MapReduce Framework By: Ahmed Khademzadeh Thesis Advisor: Philip Chan, Ph.D. Regression models have many applications in real world problems such as finance, epidemiology, environmental science, etc.. Big datasets are everywhere these days, and bigger datasets would help us to construct better models from the data. The issue with big datasets is that...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014